AITopics | asr error

Collaborating Authors

asr error

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Automatic Speech Recognition Biases in Newcastle English: an Error Analysis

Serditova, Dana, Tang, Kevin, Steffens, Jochen

arXiv.org Artificial IntelligenceAug-26-2025

Automatic Speech Recognition (ASR) systems struggle with regional dialects due to biased training which favours mainstream varieties. While previous research has identified racial, age, and gender biases in ASR, regional bias remains underexamined. This study investigates ASR performance on Newcastle English, a well-documented regional dialect known to be challenging for ASR. A two-stage analysis was conducted: first, a manual error analysis on a subsample identified key phonological, lexical, and morphosyntactic errors behind ASR misrecognitions; second, a case study focused on the systematic analysis of ASR recognition of the regional pronouns ``yous'' and ``wor''. Results show that ASR errors directly correlate with regional dialectal features, while social factors play a lesser role in ASR mismatches. We advocate for greater dialectal diversity in ASR training data and highlight the value of sociolinguistic analysis in diagnosing and addressing regional biases.

artificial intelligence, newcastle english, speech recognition, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2025-1973

2506.16558

Country: Europe > United Kingdom > England (0.94)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

Causal Structure Discovery for Error Diagnostics of Children's ASR

Singh, Vishwanath Pratap, Sahidullah, Md., Kinnunen, Tomi

arXiv.org Artificial IntelligenceJun-3-2025

Children's automatic speech recognition (ASR) often underperforms compared to that of adults due to a confluence of interdependent factors: physiological (e.g., smaller vocal tracts), cognitive (e.g., underdeveloped pronunciation), and extrinsic (e.g., vocabulary limitations, background noise). Existing analysis methods examine the impact of these factors in isolation, neglecting interdependencies--such as age affecting ASR accuracy both directly and indirectly via pronunciation skills. In this paper, we introduce a causal structure discovery to unravel these interdependent relationships among physiology, cognition, extrinsic factors, and ASR errors. Then, we employ causal quantification to measure each factor's impact on children's ASR. We extend the analysis to fine-tuned models to identify which factors are mitigated by fine-tuning and which remain largely unaffected. Experiments on Whisper and Wav2V ec2.0 demonstrate the generalizability of our findings across different ASR systems.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2506.00402

Genre:

Research Report > Experimental Study (0.94)
Research Report > New Finding (0.88)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Reward-Driven Interaction: Enhancing Proactive Dialogue Agents through User Satisfaction Prediction

Shen, Wei, He, Xiaonan, Zhang, Chuheng, Zhang, Xuyun, Xu, Xiaolong, Dou, Wanchun

arXiv.org Artificial IntelligenceMay-27-2025

Reward-driven proactive dialogue agents require precise estimation of user satisfaction as an intrinsic reward signal to determine optimal interaction strategies. Specifically, this framework triggers clarification questions when detecting potential user dissatisfaction during interactions in the industrial dialogue system. Traditional works typically rely on training a neural network model based on weak labels which are generated by a simple model trained on user actions after current turn. However, existing methods suffer from two critical limitations in real-world scenarios: (1) Noisy Reward Supervision, dependence on weak labels derived from post-hoc user actions introduces bias, particularly failing to capture satisfaction signals in ASR-error-induced utterances; (2) Long-Tail Feedback Sparsity, the power-law distribution of user queries causes reward prediction accuracy to drop in low-frequency domains. The noise in the weak labels and a power-law distribution of user utterances results in that the model is hard to learn good representation of user utterances and sessions. To address these limitations, we propose two auxiliary tasks to improve the representation learning of user utterances and sessions that enhance user satisfaction prediction. The first one is a contrastive self-supervised learning task, which helps the model learn the representation of rare user utterances and identify ASR errors. The second one is a domain-intent classification task, which aids the model in learning the representation of user sessions from long-tailed domains and improving the model's performance on such domains. The proposed method is evaluated on DuerOS, demonstrating significant improvements in the accuracy of error recognition on rare user utterances and long-tailed domains.

machine learning, natural language, user utterance, (19 more...)

arXiv.org Artificial Intelligence

2505.18731

Country: Asia > China (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)

Add feedback

Not All Errors Are Equal: Investigation of Speech Recognition Errors in Alzheimer's Disease Detection

Kang, Jiawen, Li, Junan, Li, Jinchao, Wu, Xixin, Meng, Helen

arXiv.org Artificial IntelligenceDec-9-2024

Automatic Speech Recognition (ASR) plays an important role in speech-based automatic detection of Alzheimer's disease (AD). However, recognition errors could propagate downstream, potentially impacting the detection decisions. Recent studies have revealed a non-linear relationship between word error rates (WER) and AD detection performance, where ASR transcriptions with notable errors could still yield AD detection accuracy equivalent to that based on manual transcriptions. This work presents a series of analyses to explore the effect of ASR transcription errors in BERT-based AD detection systems. Our investigation reveals that not all ASR errors contribute equally to detection performance. Certain words, such as stopwords, despite constituting a large proportion of errors, are shown to play a limited role in distinguishing AD. In contrast, the keywords related to diagnosis tasks exhibit significantly greater importance relative to other words. These findings provide insights into the interplay between ASR errors and the downstream detection model.

artificial intelligence, machine learning, transcription, (17 more...)

arXiv.org Artificial Intelligence

2412.06332

Country:

Asia > China > Hong Kong (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.51)

Add feedback

Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding

Jung, Yeonjoon, Lee, Jaeseong, Choi, Seungtaek, Lee, Dohyeon, Kim, Minsoo, Hwang, Seung-won

arXiv.org Artificial IntelligenceOct-20-2024

Recently, pre-trained language models (PLMs) have been increasingly adopted in spoken language understanding (SLU). However, automatic speech recognition (ASR) systems frequently produce inaccurate transcriptions, leading to noisy inputs for SLU models, which can significantly degrade their performance. To address this, our objective is to train SLU models to withstand ASR errors by exposing them to noises commonly observed in ASR systems, referred to as ASR-plausible noises. Speech noise injection (SNI) methods have pursued this objective by introducing ASR-plausible noises, but we argue that these methods are inherently biased towards specific ASR systems, or ASR-specific noises. In this work, we propose a novel and less biased augmentation method of introducing the noises that are plausible to any ASR system, by cutting off the non-causal effect of noises. Experimental results and analyses demonstrate the effectiveness of our proposed methods in enhancing the robustness and generalizability of SLU models against unseen ASR systems by introducing more diverse and plausible ASR noises in advance.

artificial intelligence, asr system, natural language, (16 more...)

arXiv.org Artificial Intelligence

2410.15609

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > South Korea > Seoul > Seoul (0.04)
Asia > China > Hong Kong (0.04)
(6 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Large Language Model Should Understand Pinyin for Chinese ASR Error Correction

Li, Yuang, Qiao, Xiaosong, Zhao, Xiaofeng, Zhao, Huan, Tang, Wei, Zhang, Min, Yang, Hao

arXiv.org Artificial IntelligenceSep-20-2024

Large language models can enhance automatic speech recognition systems through generative error correction. In this paper, we propose Pinyin-enhanced GEC, which leverages Pinyi, the phonetic representation of Mandarin Chinese, as supplementary information to improve Chinese ASR error correction. Our approach only utilizes synthetic errors for training and employs the one-best hypothesis during inference. Additionally, we introduce a multitask training approach involving conversion tasks between Pinyin and text to align their feature spaces. Experiments on the Aishell-1 and the Common Voice datasets demonstrate that our approach consistently outperforms GEC with text-only input. More importantly, we provide intuitive explanations for the effectiveness of PY-GEC and multitask training from two aspects: 1) increased attention weight on Pinyin features; and 2) aligned feature space between Pinyin and text hidden states.

error correction, pinyin, pinyin feature, (15 more...)

arXiv.org Artificial Intelligence

2409.13262

Country:

South America > Paraguay > Asunción > Asunción (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
(3 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Keyword-Aware ASR Error Augmentation for Robust Dialogue State Tracking

Lee, Jihyun, Im, Solee, Lee, Wonjun, Lee, Gary Geunbae

arXiv.org Artificial IntelligenceSep-10-2024

Dialogue State Tracking (DST) is a key part of task-oriented dialogue systems, identifying important information in conversations. However, its accuracy drops significantly in spoken dialogue environments due to named entity errors from Automatic Speech Recognition (ASR) systems. We introduce a simple yet effective data augmentation method that targets those entities to improve the robustness of DST model. Our novel method can control the placement of errors using keyword-highlighted prompts while introducing phonetically similar errors. As a result, our method generated sufficient error patterns on keywords, leading to improved accuracy in noised and low-accuracy ASR environments.

arxiv preprint arxiv, asr error, augmentation, (12 more...)

arXiv.org Artificial Intelligence

2409.06263

Country: Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.67)

Add feedback

MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues

Binici, Kuluhan, Kashyap, Abhinav Ramesh, Schlegel, Viktor, Liu, Andy T., Dwivedi, Vijay Prakash, Nguyen, Thanh-Tung, Gao, Xiaoxue, Chen, Nancy F., Winkler, Stefan

arXiv.org Artificial IntelligenceSep-5-2024

Automatic Speech Recognition (ASR) systems are pivotal in transcribing speech into text, yet the errors they introduce can significantly degrade the performance of downstream tasks like summarization. This issue is particularly pronounced in clinical dialogue summarization, a low-resource domain where supervised data for fine-tuning is scarce, necessitating the use of ASR models as black-box solutions. Employing conventional data augmentation for enhancing the noise robustness of summarization models is not feasible either due to the unavailability of sufficient medical dialogue audio recordings and corresponding ASR transcripts. To address this challenge, we propose MEDSAGE, an approach for generating synthetic samples for data augmentation using Large Language Models (LLMs). Specifically, we leverage the in-context learning capabilities of LLMs and instruct them to generate ASR-like errors based on a few available medical dialogue examples with audio recordings. Experimental results show that LLMs can effectively model ASR noise, and incorporating this noisy data into the training process significantly improves the robustness and accuracy of medical dialogue summarization systems. This approach addresses the challenges of noisy ASR outputs in critical applications, offering a robust solution to enhance the reliability of clinical dialogue summarization.

asr model, dialogue, transcript, (12 more...)

arXiv.org Artificial Intelligence

2408.14418

Country:

South America > Colombia > Meta Department > Villavicencio (0.04)
Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(2 more...)

Genre: Research Report > New Finding (0.88)

Industry: Health & Medicine > Health Care Technology > Medical Record (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

Wang, Minghan, Wang, Yuxia, Vu, Thuy-Trang, Shareghi, Ehsan, Haffari, Gholamreza

arXiv.org Artificial IntelligenceJun-16-2024

Recent advancements in multimodal large language models (MLLMs) have made significant progress in integrating information across various modalities, yet real-world applications in educational and scientific domains remain challenging. This paper introduces the Multimodal Scientific ASR (MS-ASR) task, which focuses on transcribing scientific conference videos by leveraging visual information from slides to enhance the accuracy of technical terminologies. Realized that traditional metrics like WER fall short in assessing performance accurately, prompting the proposal of severity-aware WER (SWER) that considers the content type and severity of ASR errors. We propose the Scientific Vision Augmented ASR (SciVASR) framework as a baseline method, enabling MLLMs to improve transcript quality through post-editing. Evaluations of state-of-the-art MLLMs, including GPT-4o, show a 45% improvement over speech-only baselines, highlighting the importance of multimodal information integration.

information, mismatch, transcript, (15 more...)

arXiv.org Artificial Intelligence

2406.1088

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
(14 more...)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques

Li, Yuanchao, Bell, Peter, Lai, Catherine

arXiv.org Artificial IntelligenceJun-12-2024

Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SER performance using ASR transcripts with varying Word Error Rates (WERs) on well-known corpora: IEMOCAP, CMU-MOSI, and MSP-Podcast. Our evaluation includes text-only and bimodal SER with diverse fusion techniques, aiming for a comprehensive analysis that uncovers novel findings and challenges faced by current SER research. Additionally, we propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript. This research is expected to provide insights into SER with ASR assistance, especially for real-world applications.

asr transcript, recognition, transcript, (11 more...)

arXiv.org Artificial Intelligence

2406.08353

Country: Asia > South Korea > Gyeonggi-do > Suwon (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (0.87)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.87)
(2 more...)

Add feedback